library(plyr)
library(tidyverse)
library(lubridate)
library(plotly)
library(caret)
library(e1071)
library(tree)
#Online Datasource
#original <- read.csv('https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-06.csv')
original <- read.csv("yellow_tripdata_2020-06.csv")
Investigating the summary of the dataset, shows that there are quite a number of negative values in various instances of the columns which could be possible outliers and thus will cause errors in the data as further analysis proceeds. Therefore, we shall need to deal with these negative values and any other possible outliers.
summary(original)
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count
Min. :1.0 Length:549760 Length:549760 Min. :0.00
1st Qu.:1.0 Class :character Class :character 1st Qu.:1.00
Median :2.0 Mode :character Mode :character Median :1.00
Mean :1.6 Mean :1.36
3rd Qu.:2.0 3rd Qu.:1.00
Max. :2.0 Max. :9.00
NA's :50717 NA's :50717
trip_distance RatecodeID store_and_fwd_flag PULocationID
Min. : 0.00 Min. : 1.00 Length:549760 Min. : 1.0
1st Qu.: 1.01 1st Qu.: 1.00 Class :character 1st Qu.:107.0
Median : 1.86 Median : 1.00 Mode :character Median :151.0
Mean : 4.10 Mean : 1.05 Mean :157.6
3rd Qu.: 3.66 3rd Qu.: 1.00 3rd Qu.:234.0
Max. :220386.23 Max. :99.00 Max. :265.0
NA's :50717
DOLocationID payment_type fare_amount extra
Min. : 1.0 Min. :1.00 Min. :-216.00 Min. :-4.500
1st Qu.: 87.0 1st Qu.:1.00 1st Qu.: 6.00 1st Qu.: 0.000
Median :151.0 Median :1.00 Median : 9.00 Median : 0.500
Mean :153.5 Mean :1.37 Mean : 13.61 Mean : 1.024
3rd Qu.:233.0 3rd Qu.:2.00 3rd Qu.: 15.50 3rd Qu.: 2.500
Max. :265.0 Max. :5.00 Max. : 941.50 Max. :87.560
NA's :50717
mta_tax tip_amount tolls_amount improvement_surcharge
Min. :-0.5000 Min. :-36.300 Min. :-28.7500 Min. :-0.300
1st Qu.: 0.5000 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.300
Median : 0.5000 Median : 1.500 Median : 0.0000 Median : 0.300
Mean : 0.4913 Mean : 1.763 Mean : 0.3671 Mean : 0.297
3rd Qu.: 0.5000 3rd Qu.: 2.750 3rd Qu.: 0.0000 3rd Qu.: 0.300
Max. : 3.3000 Max. :422.680 Max. :114.7500 Max. : 0.300
total_amount congestion_surcharge
Min. :-216.30 Min. :-2.500
1st Qu.: 10.70 1st Qu.: 2.500
Median : 14.16 Median : 2.500
Mean : 18.77 Mean : 1.968
3rd Qu.: 20.80 3rd Qu.: 2.500
Max. :1141.10 Max. : 2.500
Count all the null values
sapply(original, function(y) sum(length(which(is.na(y)))))
VendorID tpep_pickup_datetime tpep_dropoff_datetime
50717 0 0
passenger_count trip_distance RatecodeID
50717 0 50717
store_and_fwd_flag PULocationID DOLocationID
0 0 0
payment_type fare_amount extra
50717 0 0
mta_tax tip_amount tolls_amount
0 0 0
improvement_surcharge total_amount congestion_surcharge
0 0 0
Drop all the rows with null entries
original<-drop_na(original)
Already investigating, we come across the fact that the max value in “Trip distance” is a quite huge, thus we have identified an outlier.
head(sort(original$trip_distance, decreasing = T),10)
[1] 22543.99 441.60 270.32 259.13 240.70 191.80 187.90 168.70
[9] 167.50 167.10
Thus we drop that value
original <- subset.data.frame(original, original$trip_distance != 22543.99, drop = TRUE)
While looking through the dataset, we spot that the “RatecodeID” field has a value 99 that is not described in the data dictionary.
unique(original$RatecodeID)
[1] 1 2 3 5 4 99 6
print(paste("There are",sum(original$RatecodeID == 99),"rows with the value 99 in them"))
[1] "There are 57 rows with the value 99 in them"
Therefore, we remove any rows that conform to this condition
original <- subset.data.frame(original, original$RatecodeID != 99, drop = TRUE)
The data dictionary describes a value known as “Unknown” payment type and as we do not have information as to how the passenger(s) paid for their trip, we drop it
sum(original$payment_type == 5)
[1] 12
original <- subset.data.frame(original, original$payment_type != 5, drop = TRUE)
Passenger count shows that there are trips with 0 passengers as this is not feasible. As well as trips that had over 7 passengers. We shall remove the trips containing them.
unique(original$passenger_count)
[1] 1 2 0 6 3 4 5 9 8
print(paste("There are",sum(original$passenger_count %in% c(0,7,8,9)),"rows with the value of 0,7,8 and 9 passengers in them"))
[1] "There are 13285 rows with the value of 0,7,8 and 9 passengers in them"
original <- subset.data.frame(original, original$passenger_count %!in% c(0,7,8,9), drop = TRUE)
Dealing with negative values in the dataset
print(paste("There are",length(original[original < 0]),"rows with negative values in them"))
[1] "There are 13035 rows with negative values in them"
We proceed to replace all the negative values with NA
original <- replace(original,original < 0,NA)
And then drop the rows
original <- drop_na(original)
summary(select_if(original, is.numeric))
VendorID passenger_count trip_distance RatecodeID
Min. :1.000 Min. :1.000 Min. : 0.000 Min. :1.000
1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 1.000 1st Qu.:1.000
Median :2.000 Median :1.000 Median : 1.720 Median :1.000
Mean :1.613 Mean :1.394 Mean : 2.828 Mean :1.035
3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.: 3.150 3rd Qu.:1.000
Max. :2.000 Max. :6.000 Max. :441.600 Max. :6.000
PULocationID DOLocationID payment_type fare_amount
Min. : 1.0 Min. : 1.0 Min. :1.000 Min. : 0.00
1st Qu.:107.0 1st Qu.: 90.0 1st Qu.:1.000 1st Qu.: 6.00
Median :161.0 Median :155.0 Median :1.000 Median : 8.50
Mean :160.2 Mean :155.5 Mean :1.362 Mean : 11.98
3rd Qu.:234.0 3rd Qu.:234.0 3rd Qu.:2.000 3rd Qu.: 13.50
Max. :265.0 Max. :265.0 Max. :4.000 Max. :941.50
extra mta_tax tip_amount tolls_amount
Min. : 0.000 Min. :0.0000 Min. : 0.000 Min. : 0.0000
1st Qu.: 0.000 1st Qu.:0.5000 1st Qu.: 0.000 1st Qu.: 0.0000
Median : 0.500 Median :0.5000 Median : 1.500 Median : 0.0000
Mean : 1.095 Mean :0.4966 Mean : 1.782 Mean : 0.2237
3rd Qu.: 2.500 3rd Qu.:0.5000 3rd Qu.: 2.660 3rd Qu.: 0.0000
Max. :87.560 Max. :0.5000 Max. :422.680 Max. :114.7500
improvement_surcharge total_amount congestion_surcharge
Min. :0.0000 Min. : 0.00 Min. :0.000
1st Qu.:0.3000 1st Qu.: 10.35 1st Qu.:2.500
Median :0.3000 Median : 13.56 Median :2.500
Mean :0.2997 Mean : 17.20 Mean :2.123
3rd Qu.:0.3000 3rd Qu.: 18.96 3rd Qu.:2.500
Max. :0.3000 Max. :1141.10 Max. :2.500
For reproducibility we set the seed
set.seed(100518243)
We proceed to have a random selection of our data to narrow it down to 50,000 rows
index <- sample(1:nrow(original),50000)
june2020 <- (original[index,])
attach(june2020)
The following objects are masked from june2020 (pos = 3):
congestion_surcharge, DOLocationID, extra, fare_amount,
improvement_surcharge, mta_tax, passenger_count, payment_type,
PULocationID, RatecodeID, store_and_fwd_flag, tip_amount,
tolls_amount, total_amount, tpep_dropoff_datetime,
tpep_pickup_datetime, trip_distance, VendorID
dim(june2020)
[1] 50000 18
Transform the datetime columns from character to datetime data types
june2020$tpep_pickup_datetime <- ymd_hms(june2020$tpep_pickup_datetime)
june2020$tpep_dropoff_datetime <- ymd_hms(june2020$tpep_dropoff_datetime)
Convert columns to categorical factors
june2020$store_and_fwd_flag <- parse_factor(june2020$store_and_fwd_flag)
june2020$payment_type <- factor(june2020$payment_type)
june2020$VendorID <- factor(june2020$VendorID)
june2020$RatecodeID <- factor(june2020$RatecodeID)
We assign terms to the categorical columns
june2020$payment_type <- mapvalues(payment_type, from = c("1", "2", "3","4"), to = c("Credit Card", "Cash","No charge","Dispute"))
june2020$VendorID <- mapvalues(VendorID, from = c("1", "2"), to = c("Creative Mobile Technologies", "VeriFone Inc"))
#june2020$RatecodeID <- mapvalues(RatecodeID, from = c("1", "2", "3","4","5"), to = c("Standard Rate", "JFK","Newark","Nassau or Westchester","Negotiated fare"))
We proceed to extract data relating to day,month and day of the week from the datetime columns
june2020$pickup_day <- factor(day(tpep_pickup_datetime))
june2020$pickup_dayofweek <- factor(wday(tpep_pickup_datetime, label = TRUE))
june2020$dropoff_day <- factor(day(tpep_dropoff_datetime))
june2020$dropoff_dayofweek <- factor(wday(tpep_dropoff_datetime, label = TRUE))
june2020$pickup_hour <- factor(hour(tpep_pickup_datetime))
june2020$dropoff_hour <- factor(hour(tpep_dropoff_datetime))
june2020$ride_duration <- as.numeric(june2020$tpep_dropoff_datetime-june2020$tpep_pickup_datetime)
summary(june2020)
VendorID tpep_pickup_datetime tpep_dropoff_datetime
Length:50000 Min. :2020-05-31 17:00:08 Min. :2020-06-01 00:04:27
Class :character 1st Qu.:2020-06-10 11:49:22 1st Qu.:2020-06-10 11:59:28
Mode :character Median :2020-06-18 11:40:02 Median :2020-06-18 11:50:49
Mean :2020-06-17 14:13:37 Mean :2020-06-17 14:27:40
3rd Qu.:2020-06-25 00:21:33 3rd Qu.:2020-06-25 00:39:37
Max. :2020-07-01 00:02:40 Max. :2020-07-01 13:39:22
passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID
Min. :1.000 Min. : 0.000 1:49085 N:49647 Min. : 1.0
1st Qu.:1.000 1st Qu.: 1.000 2: 522 Y: 353 1st Qu.:107.0
Median :1.000 Median : 1.730 3: 64 Median :161.0
Mean :1.395 Mean : 2.876 4: 57 Mean :160.4
3rd Qu.:1.000 3rd Qu.: 3.200 5: 272 3rd Qu.:234.0
Max. :6.000 Max. :168.700 Max. :265.0
DOLocationID payment_type fare_amount extra
Min. : 1 Length:50000 Min. : 0.00 Min. : 0.000
1st Qu.: 90 Class :character 1st Qu.: 6.00 1st Qu.: 0.000
Median :158 Mode :character Median : 8.50 Median : 0.500
Mean :156 Mean : 12.16 Mean : 1.091
3rd Qu.:234 3rd Qu.: 13.50 3rd Qu.: 2.500
Max. :265 Max. :510.00 Max. :22.700
mta_tax tip_amount tolls_amount improvement_surcharge
Min. :0.0000 Min. : 0.000 Min. : 0.0000 Min. :0.0000
1st Qu.:0.5000 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.:0.3000
Median :0.5000 Median : 1.500 Median : 0.0000 Median :0.3000
Mean :0.4964 Mean : 1.797 Mean : 0.2422 Mean :0.2997
3rd Qu.:0.5000 3rd Qu.: 2.660 3rd Qu.: 0.0000 3rd Qu.:0.3000
Max. :0.5000 Max. :117.930 Max. :81.4800 Max. :0.3000
total_amount congestion_surcharge pickup_day pickup_dayofweek
Min. : 0.00 Min. :0.000 26 : 2491 Sun:4485
1st Qu.: 10.38 1st Qu.:2.500 25 : 2368 Mon:8973
Median : 13.56 Median :2.500 23 : 2322 Tue:9385
Mean : 17.42 Mean :2.119 30 : 2309 Wed:7000
3rd Qu.: 19.20 3rd Qu.:2.500 29 : 2288 Thu:7116
Max. :627.35 Max. :2.500 24 : 2278 Fri:7764
(Other):35944 Sat:5277
dropoff_day dropoff_dayofweek pickup_hour dropoff_hour
26 : 2488 Sun:4493 15 : 3837 15 : 3806
25 : 2371 Mon:8967 14 : 3811 13 : 3757
23 : 2321 Tue:9376 13 : 3768 14 : 3744
30 : 2309 Wed:7011 17 : 3685 17 : 3689
29 : 2288 Thu:7122 16 : 3636 16 : 3681
24 : 2280 Fri:7756 12 : 3627 12 : 3572
(Other):35943 Sat:5275 (Other):27636 (Other):27751
ride_duration
Min. : 0.0
1st Qu.: 320.0
Median : 534.0
Mean : 843.2
3rd Qu.: 871.0
Max. :161250.0
head(june2020)
summary(june2020$ride_duration)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 320.0 534.0 843.2 871.0 161250.0
The maximum duration for a ride is shown as 161250 seconds, which is nearly 44 hours as such we remove any rows whose duration exceed over 2 hours or 7200 seconds.
june2020 <- subset.data.frame(june2020, june2020$ride_duration <= 7200, drop=TRUE)
#After removing rides with longer than 2 hours
summary(june2020$ride_duration)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 320.0 532.0 686.3 868.0 6948.0
ggplotly(ggplot(june2020,aes(ride_duration))+
geom_density(stat='count'))
ggplotly(ggplot(data = june2020) + geom_bar(mapping = aes(x= passenger_count), fill="blue") + ylab("Trip count") + xlab("Passenger count") + ggtitle("A distribution of passenger count") )
ggplotly(ggplot(data = june2020) +
geom_bar(mapping = aes(x = VendorID),fill= c("orange", "red")) + ylab("Trip count"))
ggplotly(ggplot(data = june2020) +
geom_bar(mapping = aes(x = payment_type, fill=payment_type))+ ylab("Trip count"))
ggplotly(ggplot(data = june2020) +
geom_bar(mapping = aes(x = RatecodeID, fill=RatecodeID)) + ylab("Trip count"), tooltip = c("y", "x")+ theme(axis.text.x = element_text(angle = 45)))
ggplotly(ggplot(data = june2020) + geom_bar(mapping = aes(x=pickup_dayofweek, fill=pickup_dayofweek)) + ggtitle("Pick Up Days of the week"))
ggplotly(ggplot(data = june2020) + geom_bar(mapping = aes(x=dropoff_dayofweek, fill=dropoff_dayofweek)) + ggtitle("Drop Off Days of the week"))
ggplotly(ggplot(data = june2020) + geom_bar(mapping = aes(x=pickup_hour, fill=pickup_hour)) + ggtitle("Pick Up Hours of the week"))
ggplotly(ggplot(data = june2020) + geom_bar(mapping = aes(x=dropoff_hour, fill=dropoff_hour)) + ggtitle("Drop Off Hours of the week"))
ggplotly(ggplot(data=june2020, aes(x=pickup_hour, fill=payment_type)) + geom_bar())
ggplot(june2020, aes(x = pickup_hour, y = total_amount)) +
geom_boxplot()
ggplot(june2020,aes(trip_distance))+
geom_density(stat='count')+
xlim(0,30)+ylim(0,1000)
ggplot(june2020,aes(total_amount))+
geom_density(stat='count')+
xlim(0,150)
ggplot(june2020, aes(x=trip_distance, y=total_amount))+geom_point()+ geom_smooth(method = lm)
For the purpose of supervised learning, we shall look to predict the total_amount taxi fare based on the set of predictors. In order to better predict the total values, we shall be eliminating the negative values
First we shall split the data into training and test datasets
trainRowIndex <- sample(1:nrow(june2020), 0.7*nrow(june2020))
trainData <- june2020[trainRowIndex,]
testData <- june2020[-trainRowIndex,]
trainTdDistX <- trainData[,-5]
trainTDistY <- trainData$trip_distance
testTDistX <- testData[,-5]
testDistY <- testData$trip_distance
tripDistData <- cbind(trainTdDistX,trainTDistY)
dist.regTree <- tree(trainTDistY~.,data = tripDistData)
NAs introduced by coercion
summary(dist.regTree)
Regression tree:
tree(formula = trainTDistY ~ ., data = tripDistData)
Variables actually used in tree construction:
[1] "total_amount" "fare_amount" "ride_duration"
Number of terminal nodes: 8
Residual mean deviance: 2.53 = 88320 / 34910
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-18.03000 -0.36380 -0.00376 0.00000 0.39620 43.13000
plot(dist.regTree)
text(dist.regTree,pretty=0)
dist.regTreePred <- predict(dist.regTree,testTDistX)
NAs introduced by coercion
treePred <- data.frame(cbind(actual=testDistY, predicted=dist.regTreePred))
amt.regTreeAcc <- cor(treePred)
amt.regTreeAcc
actual predicted
actual 1.0000000 0.9013809
predicted 0.9013809 1.0000000
From this we are able to garner an accuracy prediction score of 90.1% using a regression tree to predict the trip distance.
rateCluster <- subset(june2020,select = c(RatecodeID,PULocationID,DOLocationID))
rateKM <- kmeans(rateCluster,5)
A good clustering, will have a lower value of withinss and higher value of betweenss which depends on the number of clusters ‘k’ chosen initially
str(rateKM)
List of 9
$ cluster : Named int [1:49888] 3 4 1 4 3 1 3 3 3 3 ...
..- attr(*, "names")= chr [1:49888] "186219" "411035" "115230" "62137" ...
$ centers : num [1:5, 1:3] 1.04 1.03 1.05 1.02 1.05 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:5] "1" "2" "3" "4" ...
.. ..$ : chr [1:3] "RatecodeID" "PULocationID" "DOLocationID"
$ totss : num 5.06e+08
$ withinss : num [1:5] 30925765 25681775 18353804 13405842 20866497
$ tot.withinss: num 1.09e+08
$ betweenss : num 3.97e+08
$ size : int [1:5] 13051 11658 8031 6770 10378
$ iter : int 4
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
rateClusterDf <- data.frame(rateCluster, as.factor(rateKM$cluster))
ggplot(rateClusterDf, aes(x=PULocationID, y=DOLocationID)) + geom_point(mapping = aes(color=as.factor(rateKM$cluster)))+labs(color = "RateCodeID(cluster)") +ggtitle("K-Means Clustering of Pickup and Drop Off Locations")
x_VendorTrain <- trainData[,-1]
y_VendorTrain <- as.factor(trainData$VendorID)
x_VendorTest <- testData[,-1]
y_VendorTest <- as.factor(testData$VendorID)
vendorClassData <- cbind(x_VendorTrain,y_VendorTrain)
vendorClass <- naiveBayes(y_VendorTrain~.,data = vendorClassData)
summary(vendorClass)
Length Class Mode
apriori 2 table numeric
tables 24 -none- list
levels 2 -none- character
isnumeric 24 -none- logical
call 4 -none- call
vendorPred <- predict(vendorClass,x_VendorTest)
confusionMatrix(y_VendorTest,vendorPred)
Confusion Matrix and Statistics
Reference
Prediction Creative Mobile Technologies VeriFone Inc
Creative Mobile Technologies 4993 754
VeriFone Inc 555 8665
Accuracy : 0.9125
95% CI : (0.9079, 0.917)
No Information Rate : 0.6293
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8139
Mcnemar's Test P-Value : 4.434e-08
Sensitivity : 0.9000
Specificity : 0.9199
Pos Pred Value : 0.8688
Neg Pred Value : 0.9398
Prevalence : 0.3707
Detection Rate : 0.3336
Detection Prevalence : 0.3840
Balanced Accuracy : 0.9100
'Positive' Class : Creative Mobile Technologies
Create our Train &
trainAmtX <- trainData[,-17]
trainAmtY <- trainData$total_amount
Test Data
testAmtX <- testData[,-17]
testAmtY <- testData$total_amount
testAmt <- cbind(testAmtX,testAmtY)
testAmt<- subset(testAmt,testAmt$pickup_day != 31, drop = T)
testAmtX <- testAmt[,-25]
testAmtY <- testAmt$testAmtY
totAmtData <- cbind(trainAmtX,trainAmtY)
Fit the Linear Model
lm.fit <- lm(trainAmtY~.,data = totAmtData)
summary(lm.fit)
Call:
lm(formula = trainAmtY ~ ., data = totAmtData)
Residuals:
Min 1Q Median 3Q Max
-1.2913 -0.1007 -0.0059 0.0934 12.7377
Coefficients: (12 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.541e+03 2.929e+03 -1.209 0.22669
VendorIDVeriFone Inc 7.080e-01 6.847e-03 103.402 < 2e-16 ***
tpep_pickup_datetime -4.119e-05 5.025e-06 -8.197 2.55e-16 ***
tpep_dropoff_datetime 4.341e-05 5.030e-06 8.631 < 2e-16 ***
passenger_count 2.159e-03 1.521e-03 1.420 0.15562
trip_distance -7.894e-03 8.619e-04 -9.159 < 2e-16 ***
RatecodeID2 4.779e-01 1.648e-02 28.996 < 2e-16 ***
RatecodeID3 -1.814e-02 5.364e-02 -0.338 0.73522
RatecodeID4 -7.322e-02 4.418e-02 -1.657 0.09745 .
RatecodeID5 -2.606e-01 2.781e-02 -9.368 < 2e-16 ***
store_and_fwd_flagY 7.540e-02 1.817e-02 4.150 3.33e-05 ***
PULocationID 2.021e-05 2.256e-05 0.896 0.37028
DOLocationID -1.294e-05 2.087e-05 -0.620 0.53523
payment_typeCredit Card 3.047e-02 3.769e-03 8.086 6.38e-16 ***
payment_typeDispute 2.166e-02 3.166e-02 0.684 0.49384
payment_typeNo charge 9.577e-02 1.759e-02 5.444 5.26e-08 ***
fare_amount 1.001e+00 2.941e-04 3403.160 < 2e-16 ***
extra 3.217e-01 2.835e-03 113.458 < 2e-16 ***
mta_tax 9.632e-01 5.198e-02 18.532 < 2e-16 ***
tip_amount 9.997e-01 7.475e-04 1337.347 < 2e-16 ***
tolls_amount 1.005e+00 1.368e-03 734.091 < 2e-16 ***
improvement_surcharge 1.382e+00 1.803e-01 7.668 1.79e-14 ***
congestion_surcharge 8.633e-01 2.170e-03 397.746 < 2e-16 ***
pickup_day2 2.738e+00 2.327e+00 1.177 0.23926
pickup_day3 6.150e+00 4.631e+00 1.328 0.18421
pickup_day4 9.153e+00 6.940e+00 1.319 0.18719
pickup_day5 1.223e+01 9.248e+00 1.322 0.18613
pickup_day6 1.484e+01 1.156e+01 1.284 0.19928
pickup_day7 -2.748e+00 3.041e+00 -0.904 0.36609
pickup_day8 -8.039e+00 4.536e+00 -1.772 0.07637 .
pickup_day9 -4.966e+00 3.383e+00 -1.468 0.14204
pickup_day10 -1.589e+00 3.597e+00 -0.442 0.65879
pickup_day11 1.330e+00 5.016e+00 0.265 0.79085
pickup_day12 4.442e+00 6.933e+00 0.641 0.52168
pickup_day13 7.482e+00 9.035e+00 0.828 0.40761
pickup_day14 -1.026e+01 6.330e+00 -1.620 0.10519
pickup_day15 -7.421e+00 4.242e+00 -1.749 0.08022 .
pickup_day16 -4.181e+00 2.566e+00 -1.629 0.10327
pickup_day17 -1.341e+00 2.423e+00 -0.553 0.57999
pickup_day18 1.773e+00 3.977e+00 0.446 0.65570
pickup_day19 5.026e+00 6.034e+00 0.833 0.40492
pickup_day20 8.002e+00 8.228e+00 0.973 0.33075
pickup_day21 -9.800e+00 6.527e+00 -1.502 0.13322
pickup_day22 -6.760e+00 4.273e+00 -1.582 0.11361
pickup_day23 -3.762e+00 2.140e+00 -1.758 0.07876 .
pickup_day24 -6.869e-01 1.253e+00 -0.548 0.58364
pickup_day25 2.605e+00 3.039e+00 0.857 0.39148
pickup_day26 5.748e+00 5.252e+00 1.095 0.27373
pickup_day27 8.610e+00 7.522e+00 1.145 0.25235
pickup_day28 -9.130e+00 6.924e+00 -1.319 0.18732
pickup_day29 -6.154e+00 4.617e+00 -1.333 0.18253
pickup_day30 -2.931e+00 2.314e+00 -1.266 0.20541
pickup_dayofweek.L -1.561e+01 1.262e+01 -1.237 0.21617
pickup_dayofweek.Q NA NA NA NA
pickup_dayofweek.C NA NA NA NA
pickup_dayofweek^4 NA NA NA NA
pickup_dayofweek^5 NA NA NA NA
pickup_dayofweek^6 NA NA NA NA
dropoff_day2 1.150e-02 3.254e-01 0.035 0.97181
dropoff_day3 -6.674e-01 4.785e-01 -1.395 0.16306
dropoff_day4 -8.976e-01 6.386e-01 -1.405 0.15990
dropoff_day5 -1.190e+00 8.022e-01 -1.484 0.13786
dropoff_day6 -1.201e+00 9.556e-01 -1.257 0.20887
dropoff_day7 -1.514e+00 1.117e+00 -1.355 0.17549
dropoff_day8 6.699e+00 3.915e+00 1.711 0.08708 .
dropoff_day9 6.371e+00 3.745e+00 1.701 0.08894 .
dropoff_day10 5.758e+00 3.571e+00 1.613 0.10686
dropoff_day11 5.569e+00 3.403e+00 1.637 0.10173
dropoff_day12 5.225e+00 3.236e+00 1.615 0.10632
dropoff_day13 4.746e+00 3.068e+00 1.547 0.12187
dropoff_day14 4.583e+00 2.901e+00 1.580 0.11408
dropoff_day15 4.708e+00 2.733e+00 1.723 0.08497 .
dropoff_day16 4.205e+00 2.563e+00 1.641 0.10085
dropoff_day17 4.124e+00 2.396e+00 1.721 0.08531 .
dropoff_day18 3.767e+00 2.229e+00 1.690 0.09109 .
dropoff_day19 3.273e+00 2.057e+00 1.591 0.11156
dropoff_day20 2.874e+00 1.890e+00 1.521 0.12839
dropoff_day21 2.780e+00 1.723e+00 1.613 0.10667
dropoff_day22 2.689e+00 1.556e+00 1.728 0.08391 .
dropoff_day23 2.452e+00 1.387e+00 1.767 0.07718 .
dropoff_day24 2.135e+00 1.218e+00 1.753 0.07955 .
dropoff_day25 1.605e+00 1.050e+00 1.528 0.12643
dropoff_day26 1.206e+00 8.856e-01 1.362 0.17329
dropoff_day27 9.390e-01 7.204e-01 1.303 0.19245
dropoff_day28 7.404e-01 5.565e-01 1.331 0.18335
dropoff_day29 7.330e-01 3.784e-01 1.937 0.05277 .
dropoff_day30 2.797e-01 1.980e-01 1.412 0.15786
dropoff_dayofweek.L NA NA NA NA
dropoff_dayofweek.Q NA NA NA NA
dropoff_dayofweek.C NA NA NA NA
dropoff_dayofweek^4 NA NA NA NA
dropoff_dayofweek^5 NA NA NA NA
dropoff_dayofweek^6 NA NA NA NA
pickup_hour1 2.121e-02 3.821e-02 0.555 0.57895
pickup_hour2 -1.926e-02 6.232e-02 -0.309 0.75731
pickup_hour3 -8.833e-02 8.560e-02 -1.032 0.30211
pickup_hour4 -8.094e-02 1.082e-01 -0.748 0.45430
pickup_hour5 -1.011e-01 1.191e-01 -0.849 0.39580
pickup_hour6 -3.183e-01 1.242e-01 -2.563 0.01039 *
pickup_hour7 -3.186e-01 1.269e-01 -2.511 0.01205 *
pickup_hour8 -3.368e-01 1.292e-01 -2.607 0.00913 **
pickup_hour9 -3.311e-01 1.313e-01 -2.521 0.01169 *
pickup_hour10 -3.083e-01 1.333e-01 -2.312 0.02078 *
pickup_hour11 -2.936e-01 1.352e-01 -2.171 0.02997 *
pickup_hour12 -2.966e-01 1.372e-01 -2.162 0.03061 *
pickup_hour13 -3.121e-01 1.391e-01 -2.244 0.02485 *
pickup_hour14 -3.177e-01 1.411e-01 -2.252 0.02436 *
pickup_hour15 -3.163e-01 1.431e-01 -2.211 0.02707 *
pickup_hour16 1.280e-01 1.452e-01 0.881 0.37816
pickup_hour17 1.055e-01 1.475e-01 0.715 0.47450
pickup_hour18 7.485e-02 1.498e-01 0.500 0.61731
pickup_hour19 3.903e-02 1.523e-01 0.256 0.79772
pickup_hour20 -9.490e-02 1.552e-01 -0.611 0.54087
pickup_hour21 -7.649e-02 1.583e-01 -0.483 0.62895
pickup_hour22 -1.110e-02 1.620e-01 -0.068 0.94539
pickup_hour23 -4.706e-02 1.660e-01 -0.284 0.77677
dropoff_hour1 -5.093e-02 3.649e-02 -1.396 0.16277
dropoff_hour2 -1.079e-02 5.901e-02 -0.183 0.85492
dropoff_hour3 1.872e-02 8.189e-02 0.229 0.81917
dropoff_hour4 -5.786e-02 1.051e-01 -0.551 0.58188
dropoff_hour5 -4.800e-02 1.187e-01 -0.404 0.68585
dropoff_hour6 -1.713e-01 1.246e-01 -1.375 0.16924
dropoff_hour7 -1.758e-01 1.273e-01 -1.381 0.16725
dropoff_hour8 -1.661e-01 1.295e-01 -1.282 0.19981
dropoff_hour9 -1.920e-01 1.316e-01 -1.459 0.14453
dropoff_hour10 -2.201e-01 1.335e-01 -1.649 0.09926 .
dropoff_hour11 -2.411e-01 1.353e-01 -1.782 0.07482 .
dropoff_hour12 -2.362e-01 1.371e-01 -1.722 0.08502 .
dropoff_hour13 -2.251e-01 1.389e-01 -1.620 0.10520
dropoff_hour14 -2.290e-01 1.409e-01 -1.626 0.10401
dropoff_hour15 -2.349e-01 1.428e-01 -1.645 0.10001
dropoff_hour16 -1.734e-01 1.449e-01 -1.197 0.23141
dropoff_hour17 -1.268e-01 1.471e-01 -0.862 0.38844
dropoff_hour18 -1.087e-01 1.493e-01 -0.728 0.46669
dropoff_hour19 -1.043e-01 1.516e-01 -0.688 0.49135
dropoff_hour20 -1.453e-01 1.545e-01 -0.941 0.34682
dropoff_hour21 -1.909e-01 1.575e-01 -1.212 0.22569
dropoff_hour22 -2.303e-01 1.613e-01 -1.428 0.15324
dropoff_hour23 -2.111e-01 1.651e-01 -1.279 0.20101
ride_duration NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2784 on 34793 degrees of freedom
Multiple R-squared: 0.9996, Adjusted R-squared: 0.9995
F-statistic: 6.103e+05 on 127 and 34793 DF, p-value: < 2.2e-16
Predict on the linear Model
lm.pred <- predict(lm.fit,testAmtX)
actuals_preds <- data.frame(cbind(actuals=testAmtY, predicted=lm.pred))
Create a confusion matrix showing model accuracy
correlation_accuracy <- cor(actuals_preds)
correlation_accuracy
actuals predicted
actuals 1.0000000 0.9997615
predicted 0.9997615 1.0000000
head(actuals_preds)
Print the RMSE, the closer to 0 the better
RMSE(lm.pred,testAmtY)
[1] 0.2788686
Print MAE, the closer to 0 the better
MAE(lm.pred,testAmtY)
[1] 0.1545079
Plot out linear model
ggplot(actuals_preds,aes(actuals_preds$predicted, actuals_preds$actual)) +
geom_point(color = "darkred", alpha = 0.5) +
geom_smooth(method=lm)+ ggtitle('Linear Regression ') +
ggtitle("Linear Regression: Prediction vs Test Data") +
xlab("Predecited Total Amount") +
ylab("Observed Total Amount")